Fix some threading issues (some free-threading specific)#2162
Conversation
| if _can_use_structured_sm_split(): | ||
| return _split_with_general_api(self, opts, dry_run) | ||
| # SplitByCount requires the same 12.4+ as green ctx support (already checked above) | ||
| return _split_with_count_api(self, opts, dry_run) |
There was a problem hiding this comment.
Pointing this out inline, because this is probably the thing I am least sure about. Should SMResource.split() be thread-safe if called on the same resource from multiple threads?
|
This should probably be a separate PR, but are there best practices for CI to help ensure these sorts of issues don't come back? Should we be running |
Yeah, I just thought I'd split out this PR first and get it in and then open a PR for that. Running (ideally we use |
This fixes a few threading issues, but we may want to discuss some details still. * The GraphNode cleanup order is an important fix. Another thread may end up with the same pointer (but new object) as soon as we clean it up. So we have to remove it from the cache before cleaning it up. * Use of atomics: I think this is needed, but for this one place an atomic seemed more reasonable. (However, hard to test and if it can fail IIUC only on ARM.) * The critical sections should be pretty safe. I am not sure they will all ensure that the object is always the _identity_ but I am pretty sure it protects from worse races. (Testing did find this for MemPool.attributes, not others yet. Testing with thread-sanitizer might flush out some...) * The split mutex: This is thread-unsafe. But I am honestly not sure if that isn't just expected, or whether the mutex is good but it should also be safe from within CUDA. * Use of `setdefault` cached pattern is largely just normalizing. Without the `return dict.setdefault` a different instance may be returned on different threads (or a cache entry replaced). For the `cyGraphMemoryResource` that triggered a test with pytest-run-parallel although that doesn't mean it is problematic as such. `cuda-pathfinder` uses functools.cache, but usually for strings; the one we may want to look at is `load_nvidia_dynamic_lib`. Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
…s good... Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
|
It would be great if at least the critical section and atomics changes were broken out into a separate PR. |
|
OK, I split out most things and will update after that (or split out the final chunks), until then marking as draft. What is remaining is the GraphNode cleanup order and the mutex for |
Andy-Jost
left a comment
There was a problem hiding this comment.
Just commenting (not approving) because this PR is marked draft (I think it will be split).
I'm fine with these fixes as they are, though I'd also like to explore more structural fixes to decrease the likelihood problems are reintroduced. None of that blocks this change for me. Thematically, I think it amounts to moving more things into the C++ layer. Some ideas below from my bot (where Level 1 and Level 2 refer to the registry levels documented in REGISTRY_DESIGN.md):
There are more structural options, and they're largely aligned with where the handle layer was already heading. This PR is tactical patching; the strategic fix is centralize lifetime + identity + lazy state in C++ boxes/factories, and make Cython thin.
- Push lazy-init into Boxes (replaces atomics/critical sections for CUDA metadata)
A lot of what got@cython.critical_section / atomics is really "query driver once per handle": Buffer mem attrs, pool attrs, kernel attrs. That belongs in the box next to the resource, withstd::once_flagor a box mutex:
struct DevicePtrBox {
CUdeviceptr resource;
mutable MemAttrs attrs;
mutable std::once_flag attrs_init;
};
Cython properties become return box->get_attrs() — no Python-side check-then-act. This matches the existing handle design: state lives with ownership, not with the Python wrapper.
- Unified C++ factory + destroy protocol (fixes L1 ordering in one place)
create_graph_node_handlealready owns Level 1. Extend it to own the full lifecycle:
create_graph_node(CUgraphNode, GraphHandle) → GraphNodeHandle + Python wrapper
destroy_graph_node(GraphNodeHandle) → { pop L2, unregister L1, null resource, cuGraphDestroyNode }
All under one C++ mutex. Then Cython GraphNode.destroy() is a one-liner. The invalidate-then-destroy rule becomes impossible to get wrong at call sites.
- Templatized Level-2 / as_py identity cache in C++
REGISTRY_DESIGN.md already cites pybind11's registered_instances. A generic:
template<typename Handle>
class PyIdentityRegistry {
// key: owner identity (owner_before / box pointer)
// value: PyObject* via weakref
};
could serve GraphNode, Kernel, Event, etc. Key on handle ownership, not raw CU*. Registration happens in the factory; unregistration in the destroy protocol. Cython stops maintaining _node_registry by hand.
as_py(h) in DESIGN.md was meant to return Python wrappers — today GraphNode's as_py still goes to cuda.bindings. Evolving it to return the cached cuda.core object would collapse L2 into the handle layer.
- What you still can't fully move out of Cython
• Polymorphic construction: GN_create_impl dispatches to EmptyNode, KernelNode, … C++ either needs a typed factory table or a Cython callback registered at init — still one entry point, but not zero Cython.
• Pure-Python lazy caches: IPCDataForBuffer, DLPack/StridedMemoryView inference, ComputeCapability tuples — these aren't handle metadata; they'll need Python locks or redesign.
• APIs that aren't handle-backed: SMResource.split() — mutex or a documented "caller serializes" contract unless the driver guarantees more.
|
Not sure I am following all of it, but C++ threading does seem much more mature than Python/Cython so pushing more to C++ is probably a good idea.
|
This fixes a few threading issues, but we may want to discuss some details still.
(A critical section would work as well, since it is a lock which enforces order.)
pytest-run-paralleldid find this forMemPool.attributesand I could not reproduce after adding it. The other critical sections are added with a code-search, though.setdefaultcached pattern is largely just normalizing. Without thereturn dict.setdefaulta different instance may be returned on different threads (or a cache entry replaced). For thecyGraphMemoryResourcethat triggered a test with pytest-run-parallel although that doesn't mean it is problematic as such.cuda-pathfinderuses functools.cache, but usually for strings; the one we may want to look at isload_nvidia_dynamic_lib.Checklist
Implementing gh-2114 would add implicit coverage for most but the atomics and some of the critical sections (if it adds it might be so rare to effectively never trigger). So this is split out.
(If preferred I can add some explicitly threaded tests.)
I have no full runs with pytest-run-parallel, but I want to clean up the test harness a bit more. I also still should test without free-threading (useful to flush out forgotten
with nogil:blocks that can cause deadlocks).